Playback apparatus for playing back hierarchically-encoded video image data, method for controlling the playback apparatus, and storage medium

ABSTRACT

Hierarchically-encoded video image data and audio data associated with a predetermined encoded layer of the hierarchically-encoded video image data are received, and the audio-associated encoded layer with which the audio data is associated is specified from among a plurality of encoded layers of the hierarchically-encoded video image data. Then, the ratio of the field of view of a decoded video image of an encoded layer to be played back to the field of view of a decoded video image of the audio-associated encoded layer is calculated. A prestored audio correction amount is multiplied by the calculated ratio, and the resulting new audio correction amount is used to correct the audio data.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a technology for playing backhierarchically-encoded video image data in which video images of aplurality of different resolutions or fields of view are hierarchicallyencoded in a single video stream data as well as audio data associatedwith a certain encoded layer of the hierarchically-encoded video imagedata.

2. Description of the Related Art

In recent years, 720×480 pixel or 1440×1080 pixel resolution videoimages encoded using an MPEG2-Video encoding system have beenbroadcasted by terrestrial digital broadcasting. Regarding terrestrialdigital broadcasting, 320×240 pixel video images encoded using anH.264/AVC (Audio Visual Coding) encoding system have also beenbroadcasted for mobile phones and other portable devices through aseparate stream called one-segment broadcasting.

On the other hand, an H.264/SVC (Scalable Video Coding) technologycapable of encoding video images of a plurality of resolutions into asingle video stream data has been standardized as an extension of theH.264/AVC. According to the H.264/SVC standard, for example, a pluralityof video images of different resolutions, for example, 320×240, 720×480,and 1440×1080 pixel resolutions are hierarchically encoded into a singlevideo stream data as different encoded layers (also referred to aslayers). Encoding video images of different resolutions into a singledata stream as described above can provide higher compression andtransmission efficiency as compared to cases where separate videostreams are transmitted.

Moreover, according to the H.264/SVC standard, it is also possible toencode a plurality of video images of different fields of view into asingle video stream data. For example, a full-frame video image showingan entire soccer ground and a video image of a specific region showingonly a region that includes a soccer player in that full-frame videoimage are hierarchically encoded into different layers. Then, duringplayback, the layers are selectively decoded, making it possible tochange the field of view of a video image being viewed or to play back avideo image suited to the display resolution and the display aspectratio of the display apparatus.

In this manner, by using the H.264/SVC standard, a plurality of types ofdisplay apparatuses can be supported with transmission of only a singlevideo stream data without the need to transmit video through differentstreams as in the case of terrestrial digital broadcasting andone-segment broadcasting. This means that transmission band efficiencycan be increased and services that enable a user to choose a pluralityof video image sizes or fields of view can be provided, and therefore,it is envisaged that in the future, hierarchically-encoded video imagedata compliant with the H.264/SVC standard will be used for televisionbroadcasting.

It should be noted that even in the case of using hierarchically-encodedvideo image data for television broadcasting, a situation in which onlya single audio data stream is provided as in current broadcasting canalso be conceived. As described above, the use of hierarchically-encodedvideo image data enables the user to choose a layer to change the fieldof view of a video image to be viewed. However, in the case where only asingle audio data stream associated with one particular encoded layer isprovided, a problem as described below arises. That is, a problem mayarise which defies the user's sense of presence when the field of viewis changed, because even when the field of view of a video image ischanged, the audio stream data to be played back does not change.

Japanese Patent Laid-Open No. 2004-336430 discloses a technology forgiving the sense of presence to the user by changing auditorylateralization of the audio in accordance with the clipping size orposition of the video image when the field of view has been changed as aresult of enlarging a part of a video image.

However, in Japanese Patent Laid-Open No. 2004-336430, playback of videodata, such as hierarchically-encoded video image data, in which videoimages of a plurality of fields of view are hierarchically encoded wasnot taken into account. In the case where only a single audio datastream associated with one layer (i.e., one field of view) is providedwith respect to video stream data in which video images of a pluralityof fields of view are hierarchically encoded, it is required to performprocessing which are different from cases where video and audio areprovided in a one-to-one correspondence. For example, a case where, withrespect to a video content of a soccer broadcast, a video image of theentire soccer ground and a video image of a region of interest of thesoccer ground are hierarchically encoded in video stream data, and asingle audio data stream is associated with the layer of the video imageof the region of interest will be considered. In such case, ifprocessing is performed in the same manner as in conventionaltechnologies, assuming that the audio stream is associated with thelayer of the video image of the entire soccer ground, unnecessary audiocorrection processing will be applied when the video image of the regionof interest is chosen.

Moreover, there may be a case where hierarchically-encoded video streamdata contains a plurality of video images of the same resolution butdifferent fields of view. However, such a case is not taken into accountin Japanese Patent Laid-Open No. 2004-336430.

SUMMARY OF THE INVENTION

The present invention has been made in view of problems withconventional technologies, such as the problems described above. Thepresent invention provides audio playback whereby, during playback ofhierarchically-encoded video image data and audio data associated with aspecific encoded layer of the hierarchically-encoded video image data,the sense of presence is given to the user even in the case where avideo image of an encoded layer with which the audio data is notassociated is played back.

The present invention in its first aspect provides a playback apparatusthat plays back hierarchically-encoded video image data from whichdecoded video images of individual encoded layers can be obtained byselectively decoding the encoded layers and audio data associated with apredetermined encoded layer of the hierarchically-encoded video imagedata, the apparatus comprising: a receiving unit configured to receivethe hierarchically-encoded video image data and the audio data; aspecifying unit configured to specify the audio-associated encoded layerwith which the audio data is associated, from among a plurality ofencoded layers contained in the hierarchically-encoded video image datareceived by the receiving unit; an acquiring unit configured to acquireinformation on a field of view of the decoded video images that can beobtained when the plurality of encoded layers are individually decoded;a video image decoding unit configured to decode a single encoded layerchosen from among the plurality of encoded layers; an audio decodingunit configured to decode the audio data; a correcting unit configuredto, in the case where the chosen single encoded layer is not theaudio-associated encoded layer, correct an audio decoded by the audiodecoding unit using a ratio of the field of view of a decoded videoimage decoded by the video image decoding unit to the field of view of adecoded video image of the audio-associated encoded layer; and aplayback unit configured to play back the decoded video image decoded bythe video image decoding unit and the audio corrected by the correctingunit.

The present invention in its second aspect provides a method forcontrolling a playback apparatus that plays back hierarchically-encodedvideo image data from which decoded video images of individual encodedlayers can be obtained by selectively decoding the encoded layers andaudio data associated with a predetermined encoded layer of thehierarchically-encoded video image data, the method comprising: areceiving step of receiving the hierarchically-encoded video image dataand the audio data; a specifying step of specifying the audio-associatedencoded layer with which the audio data is associated, from among aplurality of encoded layers contained in the hierarchically-encodedvideo image data received in the receiving step; an acquiring step ofacquiring information on a field of view of the decoded video imagesthat can be obtained when the plurality of encoded layers areindividually decoded; a video image decoding step decoding a singleencoded layer chosen from among the plurality of encoded layers; anaudio decoding step of decoding the audio data; a correcting step of, inthe case where the chosen single encoded layer is not theaudio-associated encoded layer, correcting an audio decoded in the audiodecoding step using a ratio of the field of view of a decoded videoimage decoded in the video image decoding step to the field of view of adecoded video image of the audio-associated encoded layer; and aplayback step of playing back the decoded video image decoded in thevideo image decoding step and the audio corrected in the correctingstep.

Further features of the present invention will become apparent from thefollowing description of exemplary embodiments with reference to theattached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a functional configuration of aset-top box according to an embodiment.

FIG. 2 is a block diagram showing a functional configuration of an audiocorrection control unit of the set-top box.

FIG. 3 is a diagram for illustrating the configuration of content data.

FIG. 4 is a diagram for illustrating a component descriptor.

FIG. 5 is a diagram for illustrating additional information data.

FIG. 6 is a flowchart of audio correction control processing accordingto the embodiment.

FIG. 7 is a diagram showing an example of the configuration of an audiocorrection amount information table.

FIG. 8 is a diagram for illustrating decoded video images expressed byindividual encoded layers of hierarchically-encoded video image data.

DESCRIPTION OF THE EMBODIMENTS

Hereinafter, a preferred and exemplary embodiment of the presentinvention will be described in detail with reference to the drawings. Inthe embodiment below, an example in which the present invention isapplied to a set-top box that serves as an example of a playbackapparatus which is capable of playing back hierarchically-encoded videoimage data and audio data associated with a single encoded layer of thehierarchically-encoded video image data will be described. However, thepresent invention is applicable to any device that is capable of playingback hierarchically-encoded video image data and audio data associatedwith a predetermined encoded layer of the hierarchically-encoded videoimage data.

In the present embodiment, as will be described later, an encoded layer(“default_layer”) that is contained in the hierarchically-encoded videoimage data which is to be chosen by default during playback on aplayback apparatus is predetermined. In the description below, it isassumed that the audio data is associated with the encoded layer that isto be chosen by default among a plurality of encoded layers of thehierarchically-encoded video image data. As described above, a videoimage is encoded for each encoded layer, and decoding of a certainencoded layer provides a decoded video image of corresponding resolutionand field of view.

In the following description, the H.264/SVC standard is assumed as anexample of a hierarchical encoding method. According to this standard,when decoding a certain encoded layer, data of an encoded layersubordinate to the certain encoded layer may be utilized. In thefollowing description, a phrase “decoding a certain encoded layer as themost superordinate layer” assumes a case such as this. However, in thepresent specification, “decode a certain encoded layer” means ultimatelyobtaining a decoded video image of that encoded layer, and whether ornot data of other encoded layers are used in the process of decoding isirrelevant to the present invention. Therefore, it should be noted thatthe present invention is not limited to a specific hierarchical encodingmethod.

In the present embodiment, audio data is associated with a certainencoded layer of the hierarchically-encoded video image data. Moreover,it is assumed that the audio data was recorded at a positioncorresponding to a video image that is obtained by decoding the encodedlayer with which the audio data is associated. That is to say, in thecase where audio data is associated with a full size video imagecaptured at a position where the distance to a subject is 50 m, theaudio data corresponds to sound that is actually heard by a person atthat imaged position (where the distance to the subject is 50 m) whencapturing the video image. A full size video image captured at a certainposition is equivalent to a video image equivalent to the human field ofvision at that position. For example, in the case where a cameracaptures an image with a lens having a field of view different from thehuman field of view, such as a telephoto lens, a desirable audioacquired position for recording audio data is a camera position (aconverted imaged position) that would have been used, if the capturedvideo image had been captured with a lens having a field of viewequivalent to the human field of view. Thus, in the present embodiment,it is assumed that sound recorded at a position 100 m away from thesubject is not associated with a decoded video image captured in such amanner that the distance to the subject is equivalent to 10 m. Althoughit is desirable that the imaged position (including the converted imagedposition) and the audio acquired position are the same position, thepresent embodiment will be described assuming that even when the audioacquired position differs from the imaged position, substantially thesame sound as sound that can be heard at the imaged position can berecorded, as long as those positions are at the same distance to thesubject.

In the following description, an encoded layer with which the audio datais associated will be referred to as an audio-associated encoded layerand distinguished from an encoded layer with which no audio data isassociated. Moreover, in the following description, a data streamcontaining hierarchically-encoded video image data, audio data, andadditional information data will be referred to as content data. It isassumed that information on the field of view or the resolution of adecoded video image that is obtained for each encoded layer can beacquired from the content data.

FIG. 1 is a block diagram showing a functional configuration of aset-top box according to an embodiment of the present invention. Aset-top box 100 includes a playback control unit 101, a contentacquiring unit 102, a content analysis unit 103, a video imageprocessing unit 104, an audio processing unit 105, an operationreceiving unit 106, an audio correction control unit 107, and an audiooutput adjustment unit 108. Content data transmitted via radio wave or anetwork is input to the set-top box 100 through an antenna input or anetwork input.

The content acquiring unit 102, based on an instruction from theplayback control unit 101, applies demodulation processing and errorcorrection processing to a signal transmitted via radio wave and anetwork and outputs received content data to the content analysis unit103.

The content analysis unit 103, based on an instruction from the playbackcontrol unit 101, analyzes the content data that has been input from thecontent acquiring unit 102. Then, the content analysis unit 103separates hierarchically-encoded video image data, audio data, andadditional information data that are multiplexed in the content data.The content analysis unit 103 outputs the separatedhierarchically-encoded video image data to the video image processingunit 104, the audio data to the audio processing unit 105, and theadditional information data to the playback control unit 101.

The video image processing unit 104, based on a playback instructionfrom the playback control unit 101, selectively decodes a plurality ofencoded layers of the hierarchically-encoded video image data, with anencoded layer designated by the playback instruction regarded as themost superordinate layer, and outputs the resultant decoded video imageto an externally connected display apparatus (a display).

The audio processing unit 105, based on a playback instruction from theplayback control unit 101, decodes (audio decoding) the audio data thathas been separated by the content analysis unit 103 and input to theaudio processing unit 105, and outputs the resultant audio to the audiooutput adjustment unit 108.

The audio output adjustment unit 108, based on an audio output controlinstruction from the audio correction control unit 107, performs audioquality adjustment with respect to surround, bass, and treble of theaudio output by the audio processing unit 105 and processing forcorrecting auditory lateralization for each audio output channel, suchas L and R. Then, the audio output adjustment unit 108 outputs thecorrected audio to an externally connected audio output apparatus (aspeaker).

The operation receiving unit 106 is, for example, an infrared receivingunit, and receives a user operation signal such as a remote controllersignal (key code data) from a remote controller (not shown) that hasbeen operated by the user. Then, the operation receiving unit 106analyzes the received user operation signal and outputs the content ofthe operation to the playback control unit 101. Specifically, theoperation receiving unit 106 converts the user operation signal to asignal related to playback control, such as the start of playback,information for choosing each layer of the hierarchically-encoded videoimage data and the like, and outputs the resultant signal to theplayback control unit 101.

The playback control unit 101 instructs, in accordance with the playbackcontrol signal input from the operation receiving unit 106, the contentacquiring unit 102 to receive the content data. Furthermore, theplayback control unit 101 instructs the content analysis unit 103 toanalyze the content data input from the content acquiring unit 102, andacquires the additional information data that has been analyzed by thecontent analysis unit 103. Moreover, the playback control unit 101determines an encoded layer to be decoded, of the hierarchically-encodedvideo image data, and the audio data in accordance with the additionalinformation data input from the content analysis unit 103 and theplayback control signal input from the operation receiving unit 106.Then, the playback control unit 101 instructs the content analysis unit103 to output a decoded video image to be played back and the audio tothe video image processing unit 104 and the audio processing unit 105,respectively. Subsequently, the playback control unit 101 instructs thevideo image processing unit 104 to play back the decoded video image ofthe encoded layer that has been determined to be decoded.

Moreover, the playback control unit 101, in accordance with theadditional information data input from the content analysis unit 103 andthe playback control signal output from the operation receiving unit106, instructs the audio processing unit 105 to decode and play back theaudio data. The playback control unit 101, in accordance with theadditional information data input from the content analysis unit 103 andthe playback control signal output from the operation receiving unit106, outputs an audio correction instruction along with the additionalinformation data to the audio correction control unit 107.

The audio correction control unit 107, in accordance with the additionalinformation data and the audio correction instruction input from theplayback control unit 101, determines adjustment of audio quality, suchas surround, bass, and treble, related to audio playback and the amountof correction of auditory lateralization for each audio output channel,and instructs the audio output adjustment unit 108 to perform audiooutput adjustment.

Here, processing performed by the audio correction control unit 107 ofthe set-top box 100 will be described in greater detail using FIG. 2.The audio correction control unit 107 includes a video-audio associationjudgement unit 201, a playback video image's field of view ratiojudgement unit 202, an audio correction amount determination unit 203,and an audio correction information holding unit 204.

Information regarding playback control, such as a playback status andthe additional information data containing structural information ofeach encoded layer of the hierarchically-encoded video image data, isinput to the video-audio association judgement unit 201 from theplayback control unit 101. The video-audio association judgement unit201 specifies which encoded layer the audio data is associated with,based on the input structural information of each encoded layer. Then,the video-audio association judgement unit 201 outputs the specifiedaudio-associated encoded layer and information on each encoded layercontained in the additional information data to the playback videoimage's field of view ratio judgement unit 202.

The playback video image's field of view ratio judgement unit 202calculates a ratio between fields of view of a decoded video image ofthe audio-associated encoded layer and a decoded video image of theencoded layer to be played back, and outputs the calculation result tothe audio correction amount determination unit 203.

The audio correction amount determination unit 203 determines an audiocorrection amount for the audio data to be instructed to the audiooutput adjustment unit 108 using the field of view ratio that has beencalculated by the playback video image's field of view ratio judgementunit 202. Specifically, the audio correction amount determination unit203 references the audio correction information holding unit 204 holdingan audio correction information table that is based on genres of contentdata, and acquires an audio correction amount associated with the genreof the currently received content data. Then, the audio correctionamount determination unit 203 multiplies the audio correction amountthat has been acquired from the audio correction information holdingunit 204 by the field of view ratio to obtain an audio correctionamount, and outputs the obtained audio correction amount to the audiooutput adjustment unit 108.

FIG. 3 is a diagram showing an example of the configuration of thecontent data that is output from the content acquiring unit 102 shown inFIG. 1 and that contains the hierarchically-encoded video image data.Here, an example in which the content data is configured as a TS signaldefined by the IEC (International Electro-technical Commission), theIEEE (Institute of Electrical and Electronic Engineers), and the like isshown.

As shown in FIG. 3, the signal received by the content acquiring unit102 contains a plurality of TS packets that are time divisionmultiplexed, forming a TS signal. A “video” portion corresponding to thehierarchically-encoded video image data, an “audio” portioncorresponding to the audio data, and a “data” portion corresponding tothe additional information data are independently received in TS packetunits. The content analysis unit 103 analyzes such TS signal data,separates the TS signal data into the “video” portion, the “audio”portion, and the “data” portion, and outputs the “video” portion to thevideo image processing unit 104 and the “audio” portion to the audioprocessing unit 105. Moreover, the content analysis unit 103 outputs theresult of analysis of the “video” portion, the “audio” portion, and the“data” portion to the playback control unit 101.

Here, a management information table that is reconstructed by collectinga plurality of “data” portions is composed of a PAT (Program AssociationTable), a PMT (Program Map Table), an EIT (Event Information Table), anNIT (Network Information Table), and the like. The EIT indicates aprogram that is currently being received and a subsequent programschedule, and a table ID, a service ID, an event ID, a broadcast starttime of the program, a broadcast duration, and the like are described inan initial portion, which is followed by some descriptors.

The content analysis unit 103 performs an analysis of what kind ofprogram is being received based on the information of this EIT. Now,based on the EIT information, basic data (field) information and varioustypes of descriptors, which are related to judgement of whether thecurrently received content data is a program whose display region can bechanged and what kind of display region can be chosen, will bedescribed.

First, the table ID is an information by which the information isidentified as the EIT. The program can be identified by the event IDthat is described after the table ID. Moreover, a start time and abroadcast duration for each program are described, and it is possible tojudge when a program finishes by adding the start time and the duration.

The descriptors contained in the EIT will be described below. Examplesof the descriptors include a component descriptor indicating, forexample, information on the resolution and the aspect of a video image,a single event descriptor indicating a program name, and an event groupdescriptor in which event relay information regarding, for example,which service the rest of the program will be broadcasted on isdescribed. In the content descriptor shown in FIG. 3, “news”, “sport”,“drama”, “variety”, “education”, or the like is described as a programgenre.

FIG. 4 is an example of the additional information data in which encodedlayer information of the hierarchically-encoded video image data isdescribed using the above-described component descriptor.

Information on the hierarchically-encoded video image data is describedin the component descriptor. For example, resolution information, thatis, 480i in the case of 0x01 and 1080i in the case of 0xB1, and the likeare described in a “component_type” identifier. Moreover, the componentdescriptor contains a “default_layer” identifier underlined in FIG. 4 asinformation indicating which encoded layer is to be chosen and decodedby default at the start of playback of the content data. If the value ofthe “default_layer” identifier is 1, a first encoded layer, that is, abase layer is the encoded layer to be chosen by default. If the value ofthe “default_layer” identifier is 2, a second encoded layer, that is, afirst enhancement layer is the encoded layer to be chosen by default.That is to say, by designating a layer ID in the “default_layer”identifier, it is possible to determine an encoded layer to be initiallydisplayed in the case where there are a plurality of layer structuresand a plurality of fields of view in the hierarchically-encoded videoimage data.

Here, the configuration of a hierarchically-encoded video image will bedescribed using FIG. 5. FIG. 5 shows an example of information onencoded layers of hierarchically-encoded video image data compliant withthe H.264/SVC standard. The information on the encoded layers iscomposed of an SPS (Sequence Parameter Set) having information regardingoverall encoding, a PPS (Picture Parameter Set) regarding video imageencoding, and an AU (Access Unit), which is actual video image data, andthe respective pieces of information are subdivided into information foreach layer.

In addition to video image resolution information “video_format” of eachencoded layer, offset information, that is, “top_offset”, “left_offset”,“right_offset”, and “bottom_offset” are described in the SPS asinformation on an offset between encoded layers. That is to say, thisoffset information enables judgement of the difference in the field ofview between encoded layers. For example, if each offset value is 10, itis possible to judge that a decoded video image having a field of viewlarger than that of a reference layer by 10 pixels in each direction ofupward, downward, leftward, and rightward can be obtained. It should benoted that the offset information refers to a value representing aposition at which, when decoded video images of respective encodedlayers are set so that they have the same spatial resolution, a regionof the subject expressed in the decoded video image of an encoded layeris displayed in the decoded video image of another encoded layer. Thatis to say, the offset information is a value representing the positionat which a region of the subject expressed in the decoded video image ofan encoded layer is displayed in the decoded video image of anotherencoded layer using the number of pixels from the four sides of thedecoded video image of the other encoded layer.

Audio Correction Control Processing

Hereinafter, specific processing of audio correction control processingof the set-top box 100 according to the present embodiment having theconfiguration as described above will be described using further aflowchart in FIG. 6. The processing corresponding to this flowchart canbe realized by the playback control unit 101 reading out a correspondingprocessing program stored in, for example, a nonvolatile memory (notshown), expanding the program in a RAM (not shown), and executing theprogram. It should be noted that this audio correction controlprocessing can be started when the operation receiving unit 106 hasreceived a request to start playback of the content data from, forexample, a remote controller operated by the user, and the followingdescription is based on the assumption that this audio correctioncontrol processing is repeatedly executed during playback of the contentdata. Specifically, once the content data playback request from the useris input from the operation receiving unit 106, the playback controlunit 101 starts playback processing by instructing each block of theset-top box 100 to play back the content data and also starts this audiocorrection control processing.

In step S601, the playback control unit 101 inputs program informationcontained in the additional information data of the content data that isbeing received to the audio correction control unit 107 and causes theaudio correction control unit 107 to specify an audio-associated encodedlayer with which the audio data contained in the currently receivedcontent data is associated. Specifically, the video-audio associationjudgement unit 201 of the audio correction control unit 107 specifiesthat the audio data is associated with an encoded layer to be displayedby default from the information of the “default_layer” described in thecomponent descriptor.

In step S602, the playback control unit 101 judges whether or not thereis a difference in the field of view between a decoded video image thatis obtained by decoding the audio-associated encoded layer that has beenspecified by the audio correction control unit 107 and a decoded videoimage that is obtained by decoding a currently chosen encoded layer.Specifically, the playback control unit 101 makes judgement byreferencing the offset information on the offset between layers of theSPS, of the additional information data contained in the content datareceived from the content analysis unit 103. For example, if at leastone of “left_offset”, “right_offset”, “top_offset”, and “bottom_offset”,which are the offset information on an offset between theaudio-associated encoded layer and the currently chosen encoded layer,takes a value other than 0, the playback control unit 101 judges thatthere is a change in the field of view. The playback control unit 101advances the processing to step S603 if there is a difference in thefield of view between the decoded video image of the audio-associatedencoded layer and the decoded video image of the currently chosenencoded layer, and advances the processing to step S605 if there is nodifference.

In step S603, the playback control unit 101 causes the audio correctioncontrol unit 107 to calculate the ratio between the fields of view ofthe respective decoded video images obtained by decoding the encodedlayer to be played back and the encoded layer to be displayed by defaultamong a plurality of encoded layers contained in thehierarchically-encoded video image data. Specifically, the playbackcontrol unit 101 causes the audio correction control unit 107 tocalculate the ratio between the fields of view of the decoded videoimage obtained by decoding the audio-associated encoded layer and thedecoded video image obtained by decoding the encoded layer to be playedback.

First, the playback video image's field of view ratio judgement unit 202of the audio correction control unit 107 acquires, from the additionalinformation data of the audio-associated encoded layer and the encodedlayer to be played back, the fields of view of the respective decodedvideo images that is obtained when these encoded layers are decoded.Then, the playback video image's field of view ratio judgement unit 202calculates the field of view ratio from the information on the fields ofview of the audio-associated encoded layer and the encoded layer to beplayed back. In the present embodiment, the field of view ratio is not aratio between the resolutions of the respective decoded video images,but is defined as the square root of the ratio between the areas ofdisplay regions in the case where the decoded video images are convertedso that they have the same spatial resolution. Moreover, since the fieldof view ratio is used for audio adjustment, if the field of view ratiois a number smaller than 1, a negative reciprocal of the field of viewratio is defined as the field of view ratio. For example, if thecalculated field of view ratio is ½ times, the playback video image'sfield of view ratio judgement unit 202 outputs “−1/(½)=−2” to the audiocorrection amount determination unit 203 as the field of view ratio.

Here, a specific example of the method for calculating the field of viewratio will be described using FIG. 8. The example in FIG. 8 shows apositional relationship among decoded video images in the case whereencoded layers of hierarchically-encoded video image data that have abase layer compatible with the H.264/AVC standard and two enhancementlayers serving as superordinate layers that are extensions of the baselayer have been decoded. In the present embodiment, it is assumed thatdecoded video images obtained by decoding encoded layers all have thesame spatial resolution.

First, the offset information on the offset between layers of the SPS,of the additional information data contained in the content data isreferenced. It is assumed that in the case where the componentdescriptor describes that the “default_layer” is an enhancement layer 1and the encoded layer that is currently chosen to be played back is anenhancement layer 2, the offset information of the enhancement layer 2with respect to the enhancement layer 1 is as follows:

left_offset: 480

right_offset: 480

top_offset: 270

bottom_offset: 270

At this time, since display resolution of the decoded video image of theenhancement layer 1 is 960×540, the ratios of vertical and horizontalresolutions of the decoded video image of the enhancement layer 2 tothose of the enhancement layer 1 serving as a reference are:

$\frac{\begin{matrix}{{{horizontal}\mspace{14mu}{resolution}\mspace{14mu}{of}\mspace{14mu}{enhancement}{\mspace{11mu}\;}{layer}\mspace{14mu} 1} +} \\{{left\_ offset} + {right\_ offset}}\end{matrix}}{{horizontal}\mspace{14mu}{resolution}\mspace{14mu}{of}\mspace{14mu}{enhancement}\mspace{14mu}{layer}\mspace{14mu} 1} = {\frac{960 + 480 + 480}{960} = 2}$$\frac{\begin{matrix}{{{vertical}\mspace{14mu}{resolution}\mspace{14mu}{of}{\mspace{11mu}\;}{enhancement}\mspace{14mu}{layer}\mspace{14mu} 1} +} \\{{top\_ offset} + {bottom\_ offset}}\end{matrix}}{{vertical}\mspace{14mu}{resolution}\mspace{14mu}{of}\mspace{14mu}{enhancement}\mspace{14mu}{layer}\mspace{14mu} 1} = {\frac{540 + 270 + 270}{540} = 2}$

Consequently, the ratio between the areas of display regions is 2×2=4times. The ratio of the field of view of the enhancement layer 2 to thatof the enhancement layer 1 is calculated as the square root and found tobe 2 times.

In step S604, the playback control unit 101 causes the audio correctioncontrol unit 107 to determine the audio correction amount appropriatefor the genre of the content data that is being received and the decodedvideo image of the currently chosen encoded layer. Specifically, first,the audio correction amount determination unit 203 of the audiocorrection control unit 107 acquires an audio correction amountassociated with the genre of the currently received content data fromthe audio correction amount information table held by the audiocorrection information holding unit 204. The information on the genre ofthe content data is acquired from the content descriptor contained inthe additional information data of the received content data. FIG. 7shows an example of the audio correction amount information table, andthe audio correction information holding unit 204 holds audio correctionamounts regarding treble, bass, and surround (spatial impression ofsound) for each genre of content data as an audio correction informationtable. The audio correction amount determination unit 203 of the audiocorrection control unit 107 multiplies the audio correction amounts ofthe respective parameters associated with the content data that is beingreceived by the field of view ratio calculated in step S603, therebydeciding audio correction amounts that reflect the difference in thefield of view.

For example, a case where, as shown in FIG. 8, the content data is asoccer broadcast (the genre is “sport”), the audio-associated encodedlayer is the enhancement layer 1, and the currently chosen encoded layerto be played back is the enhancement layer 2 will be considered.

The enhancement layer 2 has a wider field of view than the enhancementlayer 1, and a soccer ground is viewed from far away. Therefore, audioassociated with the decoded video image of the enhancement layer 2 has asmaller effect of giving the sense of presence than audio in the centerof the soccer ground of the enhancement layer 1. Specifically,processing for correcting the audio so that the volume of a bass portionsuch as the sound of a ball bouncing in the soccer ground decreases, thevolume of a treble portion such as the sound of an audience or the soundtraveling throughout the soccer ground increases, and the spatialimpression of sound is improved. That is to say, the audio correctionamounts of respective adjustment parameters acquired from the audiocorrection amount information table are multiplied by the field of viewratio (2 times) calculated in step S603, and

the treble portion: 20%×2 times=40%,

the bass portion: −20%×2 times=−40%, and

the surround: 20%×2 times=40%

are the finally determined audio correction amounts.

Even in the case where there is no difference in the field of viewbetween the decoded video image of the audio-associated encoded layerand the decoded video image of the currently chosen encoded layer, instep S605, the playback control unit 101 causes the audio correctioncontrol unit 107 to determine the audio correction amounts appropriatefor the genre of the content data. It should be noted that since it isnot necessary to perform audio correction that reflects the differencein the field of view in this step, the audio correction amountdetermination unit 203 of the audio correction control unit 107 can usethe audio correction amounts associated with the genre of the contentdata, which have been acquired from the audio correction informationholding unit 204, as the audio correction amounts as they are.

Then, in step S606, the playback control unit 101 causes the audiooutput adjustment unit 108 to apply correction to the audio decoded bythe audio processing unit 105 using the audio correction amounts thathave been determined by the audio correction control unit 107. Thecorrected audio is output from an audio playback apparatus, such as aspeaker, connected to the set-top box 100 from the audio outputadjustment unit 108.

It should be noted that in the present embodiment, an audio correctionmethod that uses the audio correction amounts obtained by multiplyingthe audio correction amounts associated with the genre of the contentdata by the ratio of the field of view of the decoded video image of thecurrently chosen encoded layer to the field of view of the decoded videoimage of the audio-associated encoded layer has been described. However,according to the present invention, it is also possible to create thesense of presence of audio due to a difference in the field of view bysimply multiplying the audio by the field of view ratio without usingthe audio correction amounts associated with the genre of the contentdata.

Moreover, the above embodiment has been described assuming that theaudio data was recorded at a position corresponding to a decoded videoimage that is obtained by decoding the encoded layer with which theaudio data is associated. However, according to the present invention,in a case where there is a difference between the distance from theposition at which the audio data was recorded to the subject and thedistance from the imaged position corresponding to the video imageobtained by decoding the encoded layer with which the audio data isassociated to the subject can also be supported. That is to say, in thecase where there is a difference between the imaged position and theaudio acquired position, correction processing for correcting soundlocalization so that the audio sounds as if it were recorded at theimaged position can be performed as processing preliminary to theprocessing of the above-described embodiment. With this correctionprocessing, the audio is first corrected so that the audio sounds likeas being recorded at the imaged position, and then, the controldescribed in the above embodiment can be applied to the corrected audiodata. Although there is no need to mention here, there have been manytechniques for correcting sound localization or an acoustic image for along time, and so a correction technique suitable in carrying out thepresent invention can be applied.

Moreover, in the above-described embodiment, adjustment has beenperformed with respect to the treble, bass, and surround of the audio.However, it is also possible to judge a shift of auditory lateralizationbased on an offset value between video image layers and correct theauditory lateralization in accordance with the shift.

Moreover, in the above embodiment, a case where the content datacontains one audio data set associated with one encoded layer has beendescribed in order to facilitate understanding and description. However,the present invention is not limited to the case where there is only oneaudio data set and is applicable to a case where the content dataprovides no audio data associated with an encoded layer to be playedback. For example, in the case where two audio data sets are provided,an audio data set associated with an encoded layer having a field ofview closer to the field of view of the decoded video image of theencoded layer to be played back can be used.

Moreover, the above embodiment has been described assuming that theaudio-associated encoded layer with which the audio data is associatedis the encoded layer to be played back by default, which is described inthe additional information data contained in the content data. However,implementation of the present invention is not limited to the case wherethe audio-associated encoded layer is described in the additionalinformation data, and it is also possible to determine theaudio-associated encoded layer from, for example, the information ofeach encoded layer of the hierarchically-encoded video image data. Forexample, in the case of such content data that a full-frame video imageand a video image obtained by clipping a predetermined region of thefull-frame video image are hierarchically encoded, an encoded layer intowhich the full-frame video image is encoded may be determined as theaudio-associated encoded layer. In other words, out of a plurality ofencoded layers of the hierarchically-encoded video image data, anencoded layer from which a decoded video image having the widest fieldof view can be obtained may be used as the audio-associated encodedlayer. That is to say, since the hierarchically-encoded video image datais composed of full-frame video images that have been captured at asingle imaged position, the audio-associated encoded layer may bedetermined based on judgement that such content data contains audio datarecorded at an acquired position corresponding to the single imagedposition.

As described above, the playback apparatus of the present embodiment iscapable of audio data playback that gives the sense of presence to theuser by correcting audio data associated with a predetermined encodedlayer of hierarchically-encoded video image data so as to suit the fieldof view of a decoded video image of an encoded layer to be play back.Specifically, the playback apparatus receives hierarchically-encodedvideo image data and audio data associated with a predetermined encodedlayer of the hierarchically-encoded video image data. Furthermore, theplayback apparatus specifies the audio-associated encoded layer withwhich the audio data is associated, from among a plurality of encodedlayers of the hierarchically-encoded video image data. Then, the ratioof the field of view of a decoded video image of an encoded layer to beplayed back to the field of view of a decoded video image of theaudio-associated encoded layer is calculated. Furthermore, a prestoredaudio correction amount is multiplied by the calculated ratio to obtaina new audio correction amount, and the new audio correction amount isused to correct the audio data.

In this manner, even in the case where no audio is associated with achosen encoded layer, an audio giving the sense of presence suited tothe chosen encoded layer can be presented to the user.

Other Embodiments

Aspects of the present invention can also be realized by a computer of asystem or apparatus (or devices such as a CPU or MPU) that reads out andexecutes a program recorded on a memory device to perform the functionsof the above-described embodiment, and by a method, the steps of whichare performed by a computer of a system or apparatus by, for example,reading out and executing a program recorded on a memory device toperform the functions of the above-described embodiment. For thispurpose, the program is provided to the computer for example via anetwork or from a recording medium of various types serving as thememory device (e.g., computer-readable medium).

While the present invention has been described with reference toexemplary embodiments, it is to be understood that the invention is notlimited to the disclosed exemplary embodiments. The scope of thefollowing claims is to be accorded the broadest interpretation so as toencompass all such modifications and equivalent structures andfunctions.

This application claims the benefit of Japanese Patent Application No.2010-137677, filed Jun. 16, 2010, which is hereby incorporated byreference herein in its entirety.

What is claimed is:
 1. A playback apparatus that plays backhierarchically-encoded video image data from which decoded video imagesof individual encoded layers can be obtained by selectively decoding theencoded layers and audio data associated with a predetermined encodedlayer of the hierarchically-encoded video image data, the apparatuscomprising: at least one processor; and a memory which is coupled to theat least one processor and stores instructions which cause the at leastone processor to perform operations of following units of the playbackapparatus: a receiving unit which receives the hierarchically-encodedvideo image data and the audio data; a specifying unit which specifiesthe audio-associated encoded layer with which the audio data isassociated, from among a plurality of encoded layers contained in thehierarchically-encoded video image data received by the receiving unit;an acquiring unit which acquires information on a field of view of thedecoded video images that can be obtained when the plurality of encodedlayers are individually decoded; a video image decoding unit whichdecodes a encoded layer chosen from among the plurality of encodedlayers; an audio decoding unit which decodes the audio data; acorrecting unit which, in the case where the chosen encoded layer is notthe audio-associated encoded layer, corrects an audio decoded by theaudio decoding unit using a ratio of the field of view of a decodedvideo image decoded by the video image decoding unit to the field ofview of a decoded video image that can be obtained when theaudio-associated encoded layer is decoded; and a playback unit whichplays back the decoded video image decoded by the video image decodingunit and the audio corrected by the correcting unit.
 2. The playbackapparatus according to claim 1, further comprising a storage unit whichstores an audio correction amount for correcting the audio, wherein thecorrecting unit corrects the audio with a correction amount obtained bymultiplying the audio correction amount by the field of view ratio. 3.The playback apparatus according to claim 2, wherein the memory furtherstoring an instruction which causes the at least one processor toperform operations of an identifying unit of the playback apparatuswhich identifies a genre of a content of the hierarchically-encodedvideo image data, wherein the storage unit stores an audio correctionamount associated with a genre, and the correcting unit corrects theaudio using the audio correction amount associated with the genreidentified by the identifying unit.
 4. The playback apparatus accordingto claim 1, wherein the decoding unit uses a result of decoding anencoded layer, among the plurality of encoded layers, different from thechosen encoded layer to decode the chosen encoded layer.
 5. A method forcontrolling a playback apparatus that plays back hierarchically-encodedvideo image data from which decoded video images of individual encodedlayers can be obtained by selectively decoding the encoded layers andaudio data associated with a predetermined encoded layer of thehierarchically-encoded video image data, the method comprising:receiving the hierarchically-encoded video image data and the audiodata; specifying the audio-associated encoded layer with which the audiodata is associated, from among a plurality of encoded layers containedin the hierarchically-encoded video image data received in thereceiving; acquiring information on a field of view of the decoded videoimages that can be obtained when the plurality of encoded layers areindividually decoded; decoding an encoded layer chosen from among theplurality of encoded layers; decoding the audio data; in the case wherethe chosen encoded layer is not the audio-associated encoded layer,correcting an audio decoded in the audio decoding using a ratio of thefield of view of a decoded video image decoded in the video imagedecoding to the field of view of a decoded video image that can beobtained when the audio-associated encoded layer is decoded; and playingback the decoded video image decoded in the video image decoding and theaudio corrected in the correcting.
 6. A non-transitory computer-readablestorage medium storing a computer-executable program for causing acomputer to perform each step of the method for controlling the playbackapparatus according to claim
 5. 7. The method according to claim 5,further comprising storing an audio correction amount for correcting theaudio, wherein, in the correcting, the audio with a correction amountobtained by multiplying the audio correction amount by the field of viewratio is corrected.
 8. The method according to claim 7, furthercomprising identifying a genre of a content of thehierarchically-encoded video image data, wherein, in the storing, anaudio correction amount associated with a genre is stored, and in thecorrecting, the audio using the audio correction amount associated withthe genre identified in the identifying is corrected.
 9. The methodaccording to claim 5, wherein, in the decoding, a result of decoding anencoded layer, among the plurality of encoded layers, different from thechosen encoded layer to decode the chosen encoded layer is used.
 10. Avideo image processing apparatus comprising: at least one processor; anda memory which is coupled to the at least one processor and storesinstructions which cause the at least one processor to performoperations of following units of the video image processing apparatus: areceiving unit which receives hierarchically-encoded video image data,from which decoded video images of individual encoded layers can beobtained by selectively decoding the encoded layers, and audio dataassociated with a predetermined encoded layer of thehierarchically-encoded video image data; a first decoding unit whichdecodes a encoded layer chosen from a plurality of encoded layerscontained in the hierarchically-encoded video image data; a seconddecoding unit which decodes audio data; an acquiring unit which acquiresinformation on a field of view of a decoded video that can be obtainedwhen the plurality of encoded layers are individually decoded; and acorrecting unit which, in the case where audio data is not associatedwith the chosen encoded layer decoded by the first decoding unit,corrects an audio decoded by the second decoding unit based oninformation on the field of view of a decoded video image that can beobtained when an encoded layer, with which audio data is associated, isdecoded and information on the field of view of the chosen encoded layerdecoded by the first decoding unit.
 11. The video image processingapparatus according to claim 10, wherein the correcting unit corrects,in a case where audio data is not associated with the chosen encodedlayer decoded by the first decoding unit, the audio decoded by thesecond decoding unit based on a ratio of the field of view of thedecoded video image that can be obtained when the encoded layer, withwhich audio data is associated, is decoded and the field of view of thechosen encoded layer decoded by the first decoding unit.
 12. The videoimage processing apparatus according to claim 11, further comprising astorage unit which stores an audio correction amount for correcting theaudio, wherein the correcting unit corrects the audio with a correctionamount obtained by multiplying the audio correction amount by the fieldof view ratio.
 13. The video image processing apparatus according toclaim 12, wherein the memory further storing an instruction which causesthe at least one processor to perform operations of an identifying unitof the playback apparatus which identifies a genre of a content of thehierarchically-encoded video image data, wherein the storage unit storesan audio correction amount associated with a genre, and the correctingunit corrects the audio using the audio correction amount associatedwith the genre identified by the identifying unit.
 14. The video imageprocessing apparatus according to claim 10, wherein the first decodingunit uses a result of decoding an encoded layer, among the plurality ofencoded layers, different from the chosen encoded layer to decode thechosen encoded layer.
 15. The video image processing apparatus accordingto claim 10, wherein the correcting unit corrects at least treble, bass,surround, or auditory lateralization of the audio.
 16. The video imageprocessing apparatus according to claim 10, wherein the correcting unitcorrects, in a case where a plurality of layers each associated withaudio data are included in the plurality of encoded layers, the audiodata associated with a layer that a decoded video image, which can beobtained when the layer is decoded, has a field of view closer to thefield of view of the chosen encoded layer decoded by the first decodingunit.
 17. The video image processing apparatus according to claim 10,wherein the audio data is associated with an encoded layer, of which adecoded video image, that can be obtained when the encoded layer isdecoded, has the widest field of view, of the plurality of encodedlayers.
 18. A method for controlling a video image processing apparatuscomprising: receiving hierarchically-encoded video image data, fromwhich decoded video images of individual encoded layers can be obtainedby selectively decoding the encoded layers, and audio data associatedwith a predetermined encoded layer of the hierarchically-encoded videoimage data; decoding a encoded layer chosen from a plurality of encodedlayers contained in the hierarchically-encoded video image data;decoding audio data; acquiring information on a field of view of adecoded video that can be obtained when the plurality of encoded layersare individually decoded; and in the case where audio data is notassociated with the chosen encoded layer decoded in the first decoding,correcting an audio decoded in the second decoding based on informationon the field of view of a decoded video image that can be obtained whenan encoded layer, with which audio data is associated, is decoded andinformation on the field of view of the chosen encoded layer decoded inthe first decoding.
 19. The method according to claim 18, wherein, inthe correcting, in a case where audio data is not associated with thechosen encoded layer decoded in the first decoding, the audio decoded inthe second decoding is corrected based on a ratio of the field of viewof the decoded video image that can be obtained when the encoded layer,with which audio data is associated, is decoded and the field of view ofthe chosen encoded layer decoded in the first decoding.
 20. The methodaccording to claim 19, further comprising storing an audio correctionamount for correcting the audio, wherein, in the correcting, the audiowith a correction amount obtained by multiplying the audio correctionamount by the field of view ratio is corrected.
 21. The method accordingto claim 20, further comprising identifying a genre of a content of thehierarchically-encoded video image data, wherein, in the storing, anaudio correction amount associated with a genre is stored, and in thecorrecting, the audio using the audio correction amount associated withthe genre identified in the identifying is corrected.
 22. The methodaccording to claim 18, wherein, in the first decoding, a result ofdecoding an encoded layer, among the plurality of encoded layers,different from the chosen encoded layer to decode the chosen encodedlayer is used.
 23. The method according to claim 18, wherein, in thecorrecting, at least treble, bass, surround, or auditory lateralizationof the audio is corrected.
 24. The method according to claim 18,wherein, in the correcting, in a case where a plurality of layers eachassociated with audio data are included in the plurality of encodedlayers, the audio data associated with a layer, that a decoded videoimage, which can be obtained when the layer is decoded, has a field ofview closer to the field of view of the chosen encoded layer decoded inthe first decoding, is corrected.
 25. The method according to claim 18,wherein the audio data is associated with an encoded layer, of which adecoded video image, that can be obtained when the encoded layer isdecoded, has the widest field of view, of the plurality of encodedlayers.
 26. A non-transitory computer-readable storage medium storing acomputer-executable program for causing a computer to perform each stepof the method for controlling the video image processing apparatusaccording to claim 18.